Prediction and stratification of longitudinal risk for chronic obstructive pulmonary disease across smoking behaviors

Smoking is the leading risk factor for chronic obstructive pulmonary disease (COPD) worldwide, yet many people who never smoke develop COPD. We perform a longitudinal analysis of COPD in the UK Biobank to derive and validate the Socioeconomic and Environmental Risk Score which captures additive and cumulative environmental, behavioral, and socioeconomic exposure risks beyond tobacco smoking. The Socioeconomic and Environmental Risk Score is more predictive of COPD than smoking status and pack-years. Individuals in the highest decile of the risk score have a greater risk for incident COPD compared to the remaining population. Never smokers in the highest decile of exposure risk are more likely to develop COPD than previous and current smokers in the lowest decile. In general, the prediction accuracy of the Social and Environmental Risk Score is lower in non-European populations. While smoking status is often considered in screening COPD, our finding highlights the importance of other non-smoking environmental and socioeconomic variables.

The noteworthy results from this study are that the authors have created a risk score (SERS) that moves beyond smoking (as the major risk) for time to COPD onset trained and evaluated on socioeconomic, environmental, and behaviour variables.The need for such a risk score is that 20%-30% of COPD cases worldwide consist of never smokers (with COPD likely caused by nonsmoking exposures and genetic markers).The authors also managed to improve risk scoring within smoking status subgroups, where exposures in SERS allowed for stratification of low-and high-risk individuals.They also found that a composite genome wide polygenic risk score had significantly lower predictive accuracy than smoking behaviours or SERS in their cohort.
This work is of significance as little has been published in terms of a COPD metric based on cumulative effects of socioeconomic and environmental exposures (beyond smoking).We also need to know if these risk scores work for non-European ancestry individuals.
This study used the 1/2m participant UK Biobank and included 320,115 individuals in their analysis.
The interpretability of the multivariable model results could be improved by the inclusion of confidence intervals in the second paragraph (as in para 3).From an epidemiological perspective the P values are less informative.
It is perhaps hard to interpret various groups of exposures -more detail on these variables in the main body of the text (Methods) would be welcome (e.g. was NO2 categorial or a numeric values etc. what unit value etc.)?
In the Results section -there are several paragraphs with a first sentence better placed in the introduction or methods.Short subsections and titles might be more appropriate and improve readability.
In the Discussion section, the third paragraph appears to repeat results.Provide a brief summary of these results in the first paragraph of the discussion.
There is some commentary in the discussion provided on the performance of the model.The authors should provide details and make comparison with the current prediction models in the discussion section: e.g.https://www.nature.com/articles/s41533-022-00280-0Minors: Results section page para 3 line 4 -delete 'i' and line 6 '5.2 5'.Discussion section Missing fullstop para 2.
Reviewer #2 (Remarks to the Author): Authors developed the SERS to predict COPD, which was tested with stratification of smoking status and with combination of a composite genome-wide polygenic risk score (PGS).It is very noteworthy results, which can emphasize socioeconomic and environmental factors for COPD development.
However, the review and discussion for socioeconomic and environmental factors are necessary.In particular, the factors involved for SERS should be discussed for how these factors are associated with COPD development.etc, not PM10 but NO2 became the factor to predict COPD.Bread type might be an unfamiliar factor to Asian for generalization.
SERS is divided as quintile for COPD prediction.Could you make an each score of factors composed of SERS and set a cut-off total score of SERS for COPD development?, This might be much easier and accessible for clinicians to use in the clinic.
In this large cohort study, the authors performed longitudinal analysis of COPD in the UK Biobank to develop the Socioeconomic and Environmental Risk Score (SERS) which captures additive and cumulative environmental, behavioral, and socioeconomic exposure risks beyond tobacco smoking.They reported that in addition to genetic factors, socioeconomic and environmental factors beyond smoking can predict and stratify COPD risk for both non-and smoking individuals.The findings of this study might be useful in predicting the risk of COPD.However, the findings are not new and there are several methodological concerns that warrant attention.These are outlined in the comments below.
1.Quite a few prediction models for COPD have been studied/published before (e.g., PMID:  19720809; 31542453).Please compare your results with existing literature and highlight the novelty of the current study.2.The authors claimed that the prediction accuracy of SERS was lower in the non-European populations compared to the European evaluation set.However, it should be noted that only a very small proportion of participants were non-European, which may explain the finding.3."We classified COPD based on a combination of linked hospital admission records for International Classification of Disease (ICD) 9 codes of 490, 491, 492, 494, 496 ICD-10 codes of J41.X, J43.X, J44.X, J98.2, J98.3, having a forced expiratory volume (FEV1)/forced vital capacity (FVC) ratio of < 0.70, or having self-reported COPD in an interview".To my knowledge, lung function test and interview were only conducted at baseline.The outcome of COPD events was thus mainly determined based on hospital admission records, this could be error-prone in a biobank setting.The majority of COPD patients would not go to hospital, thus the identified COPD cases were mainly severe patients.The authors need to discuss the use of medical records review as a limitation and discuss the specific problems related to using medical records as an outcome.One possible solution is to leverage additional information (e.g medication) and further curate the data.4.What's the incident rate of COPD in the follow-up time?Is it comparable with the previous literature?5.People with more resources are likely to be diagnosed COPD earlier in life, whereas those with fewer resources are more likely to have a delayed diagnosis.One would expect that this would bias estimates.6.The performance of SERS (C index = 0.770, 95% CI 0.756 to 0.784) is very close to smoking status (C index = 0.738, 95% CI 0.724 to 0.752), pack-years (C index = 0.742, 95% CI 0.727 to 0.756).This implies that smoking is the most important single predictor of COPD.It would be interest to see the performance of the combination of SERS and smoking.7.How the time of COPD onset was determined, how accurate is it?Please provide more details.8.The authors should aware that Selection bias issues are very important in UK Biobank.
We thank the editor and the reviewers for taking the time to review our manuscript.We have responded in the red color to the reviewer's comments below: Reviewer #1 (Remarks to the Author): The noteworthy results from this study are that the authors have created a risk score (SERS) that moves beyond smoking (as the major risk) for time to COPD onset trained and evaluated on socioeconomic, environmental, and behaviour variables.The need for such a risk score is that 20%-30% of COPD cases worldwide consist of never smokers (with COPD likely caused by non-smoking exposures and genetic markers).The authors also managed to improve risk scoring within smoking status subgroups, where exposures in SERS allowed for stratification of low-and high-risk individuals.They also found that a composite genome wide polygenic risk score had significantly lower predictive accuracy than smoking behaviours or SERS in their cohort.
This work is of significance as little has been published in terms of a COPD metric based on cumulative effects of socioeconomic and environmental exposures (beyond smoking).We also need to know if these risk scores work for non-European ancestry individuals.
This study used the 1/2m participant UK Biobank and included 320,115 individuals in their analysis.
We thank the reviewer for their thoughtful comments and finding that the work is novel and of significance.
The interpretability of the multivariable model results could be improved by the inclusion of confidence intervals in the second paragraph (as in para 3).From an epidemiological perspective the P values are less informative.
It is perhaps hard to interpret various groups of exposures -more detail on these variables in the main body of the text (Methods) would be welcome (e.g. was NO2 categorial or a numeric values etc. what unit value etc.)?
We apologize for the lack of clarity and have included the following additional description in the revised methods and Supplementary Tables section: Variables belonged to four data types: continuous, ordered categorical, unordered categorical, and binary (Supplementary Table 1).
We also included an additional column "Data Class" in Supplementary Table 1 describing the data class of each variable.
In the Results section -there are several paragraphs with a first sentence better placed in the introduction or methods.Short subsections and titles might be more appropriate and improve readability.
We thank the reviewer for their suggestion and have included the following subsections/titles in the revised results section to improve readability: Baseline characteristics of the study population Developing the COPD socioeconomic and environmental risk score (SERS) SERS stratifies the risk of COPD in smoking and non-smoking populations Combining genetic, environmental, and socioeconomic factors to predict COPD Evaluate prediction models in diverse populations In the Discussion section, the third paragraph appears to repeat results.Provide a brief summary of these results in the first paragraph of the discussion.
We agree with the reviewer and have removed repeated results from the revised discussion section.
There is some commentary in the discussion provided on the performance of the model.The authors should provide details and make comparison with the current prediction models in the discussion section: e.g.https://www.nature.com/articles/s41533-022-00280-0 We thank the reviewer for their suggestion.The Shah et al. paper references a progression model for predicting 10-year mortality in patients already diagnosed with COPD.In our revised discussion, we have included the following comparison of our model with previously proposed models for predicting future risk of COPD: While risk models for COPD have been proposedt 43 , none, to our knowledge, have used longitudinal biobank-level data to assess the independent risk of smoking, socioeconomic, and environmental factors on incident COPD.For instance, Chen at al. 44 modeled proposed a prediction model for FEV1 and FVC decline using data from four thousand participants in the Framingham Offspring Cohort.Their model is composed of 20 factors including pack years, laboratory blood measurements, and diseases and symptoms.Guo et al. 45 developed a COPD prediction model consisting of early life factors, genetic polymorphisms, and smoking history that was constructed using cross-sectional data from roughly 700 Chinese individuals.
We thank the reviewer for pointing this out and have corrected it in the revised manuscript.
Reviewer #2 (Remarks to the Author): Authors developed the SERS to predict COPD, which was tested with stratification of smoking status and with combination of a composite genome-wide polygenic risk score (PGS).It is very noteworthy results, which can emphasize socioeconomic and environmental factors for COPD development.
However, the review and discussion for socioeconomic and environmental factors are necessary.In particular, the factors involved for SERS should be discussed for how these factors are associated with COPD development.etc, not PM10 but NO2 became the factor to predict COPD.Bread type might be an unfamiliar factor to Asian for generalization.
We thank the reviewer for their suggestion and have elaborated on specific environmental factors in our revised discussions: In this study, we used a data-driven approach to build the COPD SERS, which includes 11 indicators of alcohol, air pollution, diet, employment, household information, physical activity, and sociodemographics information, that captures holistic socioeconomic and environmental risk beyond smoking status.Our approach re-highlighted previously reported associations between COPD risk and socioeconomic and environmental factors such as air pollution 36,37 , alcohol consumption 38,39 , physical activity 40,41 , and employment status 42 .Other risk factors that showed strong associations with COPD in our univariate XWAS procedure were not included in our final SERS model as our approach implements a shrinkage and selection procedure that favors interpretability and independence over complexity.For example, while several correlated measures of air pollution such as PM2.5, PM10, NO and NO2 are established risk factors for COPD, previous studies have shown NO2 to have the highest risk for COPD 37 .In our data-driven procedure, PM2.5, NO and NO2 were all significantly associated with COPD incidents in univariate association, but only NO2 was retained for the final multivariable model.
SERS is divided as quintile for COPD prediction.Could you make an each score of factors composed of SERS and set a cut-off total score of SERS for COPD development?, This might be much easier and accessible for clinicians to use in the clinic.
We thank the reviewer for their suggestion and agree that SERS has the potential to guide surveillance procedures and facilitate personalized care.However, implementing a clinical risk calculator into care and decision medicine requires careful validation, potentially in a randomized setting.We have included this as future directions in our revised discussion: While SERS has the potential to guide surveillance and facilitate personalized care for COPD, our results should be replicated and carefully validated in datasets and randomized experiments with longer-term follow up that collect similar environmental and behavioral instruments.In addition to this validation, future work should replicate our results in larger datasets with more diverse characteristics and ancestry backgrounds, such as the All of Us Project.
Reviewer #3 (Remarks to the Author): In this large cohort study, the authors performed longitudinal analysis of COPD in the UK Biobank to develop the Socioeconomic and Environmental Risk Score (SERS) which captures additive and cumulative environmental, behavioral, and socioeconomic exposure risks beyond tobacco smoking.They reported that in addition to genetic factors, socioeconomic and environmental factors beyond smoking can predict and stratify COPD risk for both non-and smoking individuals.The findings of this study might be useful in predicting the risk of COPD.However, the findings are not new and there are several methodological concerns that warrant attention.These are outlined in the comments below.
1.Quite a few prediction models for COPD have been studied/published before (e.g., PMID:  19720809; 31542453).Please compare your results with existing literature and highlight the novelty of the current study.
We thank the reviewer for their suggestion.The cited risk prediction models are valuable but have different hypotheses, objectives, and use cases than the model we develop here-PMID 19720809 references the COPD Assessment Test, which evaluates the impact of COPD on health status in patients with existing COPD.In other words, it models progression rather than onset, an important objective, but different than ours.In our revised discussions, we have included the following comparison of our model with previously proposed models for prediction future risk of COPD: While risk models for COPD have been proposedt 43 , none have used longitudinal biobank-level data to assess the independent risk of socioeconomic and environmental factors on incident COPD.For instance, Chen at al. 44 modeled FEV1 and FVC decline using data from four thousand participants in the Framingham Offspring Cohort.Their model is composed of 20 factors including pack years, laboratory blood measurements, and diseases and symptoms.Guo et al. 45 developed a COPD prediction model consisting of early life factors, genetic polymorphisms, and smoking history that was constructed using cross-sectional data from roughly 700 Chinese individuals.
2.The authors claimed that the prediction accuracy of SERS was lower in the non-European populations compared to the European evaluation set.However, it should be noted that only a very small proportion of participants were non-European, which may explain the finding.
We agree with the reviewer that there is a very small proportion of the study sample which were non-European.In our analysis, we randomly subsetted 1,500 individuals from each of the four largest ancestry groups for a more fair comparison.In our discussion section, we additionally state: We recognize, however, that the smaller sample size of non-European individuals in the UKB results in lower power and confidence in our conclusions, and that, certain exposure factors, such as bread type, may not be familiar and generalizable to every population.Future work should replicate our results in larger datasets with more diverse characteristics and ancestry backgrounds, such as in the All of Us Research Program 48 .

3."
We classified COPD based on a combination of linked hospital admission records for International Classification of Disease (ICD) 9 codes of 490, 491, 492, 494, 496 ICD-10 codes of J41.X, J43.X, J44.X, J98.2, J98.3, having a forced expiratory volume (FEV1)/forced vital capacity (FVC) ratio of < 0.70, or having self-reported COPD in an interview".To my knowledge, lung function test and interview were only conducted at baseline.The outcome of COPD events was thus mainly determined based on hospital admission records, this could be error-prone in a biobank setting.The majority of COPD patients would not go to hospital, thus the identified COPD cases were mainly severe patients.The authors need to discuss the use of medical records review as a limitation and discuss the specific problems related to using medical records as an outcome.One possible solution is to leverage additional information (e.g medication) and further curate the data.